This is to certify that the work I am submitting is my own. All external references and sources are clearly acknowledged and identified within the contents. I am aware of the University of Warwick regulation concerning plagiarism and collusion.
No substantial part(s) of the work submitted here has also been submitted by me in other assessments for accredited courses of study, and I acknowledge that if this has been done an appropriate reduction in the mark I might otherwise have received will be made.
# install.packages('Hmisc')
# install.packages("ggpubr")
library(tidyverse)
library(emmeans) # for emmeans(), contrast(), pairs()
library(gridExtra)
library(knitr)
library(kableExtra)
library(Hmisc) # for correlation functions
library(ggpubr) # for ggarrange
options(width=100)
# Read data and reformat data from Characters to Factors
food_df <- read.csv("2019-20-enforcement-data-food-hygiene.csv", stringsAsFactors = TRUE)
# Structure
#str(food_df)
# Summary
#summary(food_df)
# Select the variables needed and create a new dataframe
new_food_df <- food_df[, c(1:6, 19:24, 36)]
# Create a new column. Because the establishments not yet rated and not in the programme are not what we are going to analyse, they are deducted.
new_food_df <- new_food_df %>% mutate(EstratedforIntv = food_df$Totalestablishments.includingnotyetrated.outside. - food_df$Establishmentsnotyetratedforintervention - food_df$Establishmentsoutsidetheprogramme)
str(new_food_df)
## 'data.frame': 353 obs. of 14 variables:
## $ Country : Factor w/ 3 levels "England","Northern Ireland",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LAType : Factor w/ 6 levels "District Council",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LAName : Factor w/ 353 levels "Adur and Worthing ",..: 1 2 3 8 9 10 11 12 16 17 ...
## $ Totalestablishments.includingnotyetrated.outside.: int 1478 1316 1112 1208 905 1132 1746 2002 574 1400 ...
## $ Establishmentsnotyetratedforintervention : int 24 29 1 44 26 0 58 40 41 84 ...
## $ Establishmentsoutsidetheprogramme : int 0 74 0 1 1 0 214 39 0 42 ...
## $ Total.ofInterventionsachieved.premisesratedA.E. : num 96.1 90.6 88.9 94 80.7 ...
## $ Total.ofInterventionsachieved.premisesratedA : Factor w/ 30 levels "","100","33.33",..: 2 2 2 2 5 8 2 2 2 2 ...
## $ Total.ofInterventionsachieved.premisesratedB : num 100 98.3 95.1 96.3 100 ...
## $ Total.ofInterventionsachieved.premisesratedC : num 95.5 89.7 97 94.4 78.8 ...
## $ Total.ofInterventionsachieved.premisesratedD : num 96 93 91.8 92.6 85.3 ...
## $ Total.ofInterventionsachieved.premisesratedE : num 94 85.1 72.3 95.5 68.3 ...
## $ ProfessionalFullTimeEquivalentPosts.occupied.. : num 5 4 3.5 4 2 4.65 2.5 5 2 4.2 ...
## $ EstratedforIntv : int 1454 1213 1111 1163 878 1132 1474 1923 533 1274 ...
summary(new_food_df)
## Country LAType LAName
## England :320 District Council :194 Adur and Worthing : 1
## Northern Ireland: 11 London Borough : 33 Allerdale : 1
## Wales : 22 Metropolitan Borough Council: 37 Amber Valley : 1
## NI Unitary Authority : 11 Anglesey : 1
## Unitary Authority : 56 Antrim and Newtownabbey: 1
## Welsh Unitary Authority : 22 Ards and North Down : 1
## (Other) :347
## Totalestablishments.includingnotyetrated.outside. Establishmentsnotyetratedforintervention
## Min. : 145.0 Min. : 0.00
## 1st Qu.: 920.5 1st Qu.: 25.00
## Median :1330.0 Median : 49.00
## Mean :1620.7 Mean : 89.75
## 3rd Qu.:2004.5 3rd Qu.: 100.00
## Max. :9277.0 Max. :1744.00
## NA's :6 NA's :6
## Establishmentsoutsidetheprogramme Total.ofInterventionsachieved.premisesratedA.E.
## Min. : 0.00 Min. : 20.64
## 1st Qu.: 0.00 1st Qu.: 81.81
## Median : 2.00 Median : 90.82
## Mean : 49.62 Mean : 86.62
## 3rd Qu.: 39.00 3rd Qu.: 95.39
## Max. :865.00 Max. :100.00
## NA's :6 NA's :6
## Total.ofInterventionsachieved.premisesratedA Total.ofInterventionsachieved.premisesratedB
## 100 :282 Min. : 50.00
## NR : 24 1st Qu.: 93.53
## : 6 Median : 97.75
## 50 : 3 Mean : 95.25
## 80 : 3 3rd Qu.:100.00
## 88.89 : 3 Max. :100.00
## (Other): 32 NA's :6
## Total.ofInterventionsachieved.premisesratedC Total.ofInterventionsachieved.premisesratedD
## Min. : 18.37 Min. : 19.77
## 1st Qu.: 89.27 1st Qu.: 82.00
## Median : 94.97 Median : 91.72
## Mean : 91.84 Mean : 86.32
## 3rd Qu.: 97.77 3rd Qu.: 95.98
## Max. :100.00 Max. :100.00
## NA's :6 NA's :6
## Total.ofInterventionsachieved.premisesratedE ProfessionalFullTimeEquivalentPosts.occupied..
## Min. : 1.81 Min. : 0.65
## 1st Qu.: 65.39 1st Qu.: 2.50
## Median : 87.50 Median : 3.41
## Mean : 77.37 Mean : 4.10
## 3rd Qu.: 95.90 3rd Qu.: 5.00
## Max. :100.00 Max. :22.13
## NA's :6 NA's :6
## EstratedforIntv
## Min. : 144.0
## 1st Qu.: 878.5
## Median :1225.0
## Mean :1481.3
## 3rd Qu.:1788.0
## Max. :8541.0
## NA's :6
This report presents the results of the analyses requested by the Food Standards Agency. This used the data provided for 353 local authorities.
First, let’s see the distribution across the Local Authorities(LAs) of the % of enforcement actions successfully achieved. It is a left-skewed histogram, with 86.62% average % of enforcement actions successfully achieved. We can clearly see that most of the enforcement actions are successfullly achieved.
The following graph shows overall distribution across LAs by different “Country”. We can see that England is the majority country, and many of the Northern Ireland actions are successfully achieved.
The following graph shows respective distribution of establishments rated from A to E across LAs by different “Country”. All of them are left-skewed histograms, showing similar distributions to the overall(A to E) distribution.
The following graph shows overall distribution across LAs by different “LATypes”. We can see that District Council is the majority type, and the number of cases in Welsh Unitary Authority is the least.
The following graph shows respective distribution of establishments
rated from A to E across LAs by different “LATypes”
Next, we plot the regression line to see the relation between FTE food safety employees and the proportion of successful responses.
## `geom_smooth()` using formula = 'y ~ x'
And we further look at the relation between FTE food safety employees and the proportion of successful responses for establishments rated from A to E.
ggarrange(scresp.by.propemlyest.A, scresp.by.propemlyest.B, scresp.by.propemlyest.C, scresp.by.propemlyest.D, scresp.by.propemlyest.E)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 30 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 30 rows containing missing values (`geom_point()`).
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 6 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 6 rows containing missing values (`geom_point()`).
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 6 rows containing non-finite values (`stat_smooth()`).
## Removed 6 rows containing missing values (`geom_point()`).
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 6 rows containing non-finite values (`stat_smooth()`).
## Removed 6 rows containing missing values (`geom_point()`).
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 6 rows containing non-finite values (`stat_smooth()`).
## Removed 6 rows containing missing values (`geom_point()`).
Overall(A to E): For every extra FTE food safety employees, 0.12 lower proportion of successful responses is achieved 95% CI [-0.65–0.41]. The FTE food safety employees explains 0.06% of the variance in proportion of successful responses.
A: For every extra FTE food safety employees, 0.15 greater proportion of successful responses is achieved 95% CI [-0.18–0.47]. The FTE food safety employees explains 0.25% of the variance in proportion of successful responses.
B: For every extra FTE food safety employees, 0.27 greater proportion of successful responses is achieved 95% CI [-0.02–0.56]. The FTE food safety employees explains 0.98% of the variance in proportion of successful responses.
C: For every extra FTE food safety employees, 0.02 lower proportion of successful responses is achieved 95% CI [-0.42–0.37]. The FTE food safety employees explains <0.01% of the variance in proportion of successful responses.
D: For every extra FTE food safety employees, 1.11 lower proportion of successful responses is achieved 95% CI [-1.72–0.51]. The FTE food safety employees explains 3.69% of the variance in proportion of successful responses.
E: For every extra FTE food safety employees, 0.31 greater proportion of successful responses is achieved 95% CI [-0.72–1.33]. The FTE food safety employees explains 0.10% of the variance in proportion of successful responses.
According to the analysis above, we can see that there is no significant relationship between FTE food safety employees and proportion of successful responses. The increase is not different from zero, t(345)=-0.447, p=0.655. Therefore, it is not recommended to use FTE food safety employees to predict proportion of successful responses.
Next, the regression line below is aimed to see the relationship between proportion of successful responses and the number of employees as a proportion of the number of establishments in the local authority
## `geom_smooth()` using formula = 'y ~ x'
And we look at the relationship between proportion of successful responses and the number of employees as a proportion of the number of establishments in the local authority for establishments rated from A to E.
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
From the result of the models, we can see Overall(A to E): For every extra number of employees as a proportion of the number of establishments, 2407.40 greater proportion of successful responses is achieved 95% CI [1193.72–3621.06]. The number of employees as a proportion of the number of establishments explains 4.22% of the variance in proportion of successful responses.
A: For every extra number of employees as a proportion of the number of establishments, -533.30 lower proportion of successful responses is achieved 95% CI [-1290.00–223.40]. The number of employees as a proportion of the number of establishments explains 0.60% of the variance in proportion of successful responses.
B: For every extra number of employees as a proportion of the number of establishments, 128.59 greater proportion of successful responses is achieved 95% CI [-555.06–812.25]. The number of employees as a proportion of the number of establishments explains 0.04% of the variance in proportion of successful responses.
C: For every extra number of employees as a proportion of the number of establishments, 1169.56 greater proportion of successful responses is achieved 95% CI [250.03–2089.01]. The number of employees as a proportion of the number of establishments explains 1.78% of the variance in proportion of successful responses.
D: For every extra number of employees as a proportion of the number of establishments, 1736.63 greater proportion of successful responses is achieved 95% CI [302.88–3170.38]. The number of employees as a proportion of the number of establishments explains 1.62% of the variance in proportion of successful responses.
E: For every extra number of employees as a proportion of the number of establishments, 4444.40 greater proportion of successful responses is achieved 95% CI [2080.12–6808.69]. The number of employees as a proportion of the number of establishments explains 3.81% of the variance in proportion of successful responses.
According to the analysis above, we can see that there is significant relationship between number of employees as a proportion of the number of establishments rated(overall) and proportion of successful responses. The increase is significantly different from zero, t(345)=3.90, p<.001. Therefore, it is reasonable to use number of employees as a proportion of the number of establishments rated to predict proportion of successful responses.
# install.packages('Hmisc')
library(tidyverse)
library(emmeans)
library(gridExtra)
library(knitr)
library(kableExtra)
library(Hmisc)
library(RColorBrewer)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
options(width=100)
# Import dataset and transform character variables into factor
pub_df <- read.csv('publisher_sales.csv', stringsAsFactors = TRUE)
# Summarise data for more information and check data types to ensure appropriateness
str(pub_df)
## 'data.frame': 6000 obs. of 7 variables:
## $ sold.by : Factor w/ 13 levels "Amazon Digital Services, Inc.",..: 11 1 1 1 13 13 1 6 1 1 ...
## $ publisher.type: Factor w/ 5 levels "amazon","big five",..: 2 3 5 5 2 2 5 2 5 5 ...
## $ genre : Factor w/ 3 levels "childrens","fiction",..: 1 3 3 2 1 1 2 1 2 1 ...
## $ avg.review : num 4.44 4.19 3.71 4.72 4.65 4.81 4.33 4.21 3.95 4.66 ...
## $ daily.sales : num 61.5 74.9 66 85.2 37.7 ...
## $ total.reviews : int 92 130 118 179 111 106 205 86 161 81 ...
## $ sale.price : num 8.03 9.08 9.48 12.32 5.78 ...
summary(pub_df)
## sold.by publisher.type genre
## Amazon Digital Services, Inc. :4311 amazon : 78 childrens :2000
## Random House LLC : 442 big five :1658 fiction :2000
## Penguin Group (USA) LLC : 307 indie :1330 non_fiction:2000
## Simon and Schuster Digital Sales Inc: 272 single author: 809
## HarperCollins Publishers : 261 small/medium :2125
## Macmillan : 175
## (Other) : 232
## avg.review daily.sales total.reviews sale.price
## Min. :0.000 Min. : -0.53 Min. : 0.0 Min. : 0.740
## 1st Qu.:4.100 1st Qu.: 56.77 1st Qu.:105.0 1st Qu.: 7.140
## Median :4.400 Median : 74.29 Median :128.0 Median : 8.630
## Mean :4.267 Mean : 79.11 Mean :132.6 Mean : 8.641
## 3rd Qu.:4.620 3rd Qu.: 98.02 3rd Qu.:163.0 3rd Qu.:10.160
## Max. :4.980 Max. :207.98 Max. :248.0 Max. :17.460
##
# Remove an unreasonable value
pub_df <- pub_df %>% filter(daily.sales >= 0)
summary(pub_df)
## sold.by publisher.type genre
## Amazon Digital Services, Inc. :4310 amazon : 78 childrens :2000
## Random House LLC : 442 big five :1658 fiction :2000
## Penguin Group (USA) LLC : 307 indie :1329 non_fiction:1999
## Simon and Schuster Digital Sales Inc: 272 single author: 809
## HarperCollins Publishers : 261 small/medium :2125
## Macmillan : 175
## (Other) : 232
## avg.review daily.sales total.reviews sale.price
## Min. :0.000 Min. : 3.49 Min. : 0.0 Min. : 0.740
## 1st Qu.:4.100 1st Qu.: 56.78 1st Qu.:105.0 1st Qu.: 7.140
## Median :4.400 Median : 74.30 Median :128.0 Median : 8.630
## Mean :4.267 Mean : 79.12 Mean :132.6 Mean : 8.641
## 3rd Qu.:4.620 3rd Qu.: 98.02 3rd Qu.:163.0 3rd Qu.:10.160
## Max. :4.980 Max. :207.98 Max. :248.0 Max. :17.460
##
# Confirm whether there is na values again
sum(is.na(pub_df))
## [1] 0
One-Way ANOVA
pub_df %>% group_by(genre) %>% summarise(n())
## # A tibble: 3 × 2
## genre `n()`
## <fct> <int>
## 1 childrens 2000
## 2 fiction 2000
## 3 non_fiction 1999
m.sales.by.genre <- lm(daily.sales~genre, data=pub_df)
anova(m.sales.by.genre)
## Analysis of Variance Table
##
## Response: daily.sales
## Df Sum Sq Mean Sq F value Pr(>F)
## genre 2 2562023 1281012 2594.7 < 2.2e-16 ***
## Residuals 5996 2960293 494
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
( m.sales.by.genre.emm <- emmeans(m.sales.by.genre, ~genre) )
## genre emmean SE df lower.CL upper.CL
## childrens 55.6 0.497 5996 54.6 56.6
## fiction 105.9 0.497 5996 104.9 106.9
## non_fiction 75.9 0.497 5996 74.9 76.9
##
## Confidence level used: 0.95
( m.sales.by.genre.pairs <- confint(pairs(m.sales.by.genre.emm)) )
## contrast estimate SE df lower.CL upper.CL
## childrens - fiction -50.3 0.703 5996 -52.0 -48.7
## childrens - non_fiction -20.3 0.703 5996 -22.0 -18.7
## fiction - non_fiction 30.0 0.703 5996 28.3 31.6
##
## Confidence level used: 0.95
## Conf-level adjustment: tukey method for comparing a family of 3 estimates
Presenting a One-Way ANOVA
p.sales <- ggplot(summary(m.sales.by.genre.emm), aes(x=genre, y=emmean, ymin=lower.CL, ymax=upper.CL)) + geom_point() + geom_linerange() + labs(x="Genres ", y="Daily Sales", title = "CI for Daily Sales by Genres", subtitle="Error Bars are Extent of 95% CIs")
p.contrasts <- ggplot(m.sales.by.genre.pairs, aes(x=contrast, y=estimate, ymin=lower.CL, ymax=upper.CL)) + geom_point() + geom_linerange() + geom_hline(yintercept=0, lty=2) + labs(x="Contrast", y="Difference in Average Daily Sales", title = "CI for Daily Sales by \n Difference between Genres", subtitle="Error Bars are Extent of 95% CIs") + theme(axis.text.x = element_text(angle = 10))
grid.arrange(p.sales, p.contrasts, ncol=2)
# Predicting average daily slaes from both average review and total reviews
m.sales.by.avg.total.review <- lm(daily.sales ~ avg.review + total.reviews, data = pub_df)
summary(m.sales.by.avg.total.review)
##
## Call:
## lm(formula = daily.sales ~ avg.review + total.reviews, data = pub_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -103.407 -14.656 -1.071 13.672 122.177
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.123430 2.340120 10.309 < 2e-16 ***
## avg.review -3.999637 0.512874 -7.798 7.34e-15 ***
## total.reviews 0.543327 0.007816 69.517 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.58 on 5996 degrees of freedom
## Multiple R-squared: 0.4463, Adjusted R-squared: 0.4461
## F-statistic: 2416 on 2 and 5996 DF, p-value: < 2.2e-16
# Check multicollinearity again
vif(m.sales.by.avg.total.review)
## avg.review total.reviews
## 1.011022 1.011022
# Show the effect of one variable with the other variables held constant
sales.preds <- tibble(avg.review = unlist(expand.grid(seq(0, 5, 1), seq(0, 250, 5))[1]), total.reviews = unlist(expand.grid(seq(0, 5, 1), seq(0, 250, 5))[2]))
sales.preds <- mutate(sales.preds, Sales.hat = predict(m.sales.by.avg.total.review, sales.preds))
ggplot(sales.preds, aes(avg.review, total.reviews)) + geom_contour_filled(aes(z = Sales.hat)) + guides(fill=guide_legend(title="Sales"))
# Get the confidence intervals for the estimation approach
cbind(coefficient=coef(m.sales.by.avg.total.review), confint(m.sales.by.avg.total.review))
## coefficient 2.5 % 97.5 %
## (Intercept) 24.123430 19.5359532 28.7109077
## avg.review -3.999637 -5.0050539 -2.9942200
## total.reviews 0.543327 0.5280054 0.5586487
# interaction
m.intr <- lm(daily.sales ~ avg.review*total.reviews, data = pub_df)
summary(m.intr)
##
## Call:
## lm(formula = daily.sales ~ avg.review * total.reviews, data = pub_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -104.084 -14.641 -0.946 13.822 92.351
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.676679 4.174384 15.254 < 2e-16 ***
## avg.review -13.709806 0.992277 -13.817 < 2e-16 ***
## total.reviews 0.165853 0.034038 4.873 1.13e-06 ***
## avg.review:total.reviews 0.091422 0.008028 11.388 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.34 on 5995 degrees of freedom
## Multiple R-squared: 0.458, Adjusted R-squared: 0.4577
## F-statistic: 1689 on 3 and 5995 DF, p-value: < 2.2e-16
intr.surf <- tibble(avg.review = unlist(expand.grid(seq(0, 5, 1), seq(0, 250, 5))[1]), total.reviews = unlist(expand.grid(seq(0, 5, 1), seq(0, 250, 5))[2]))
intr.surf <- mutate(intr.surf, intr.hat = predict(m.intr, intr.surf))
ggplot(intr.surf, aes(avg.review, total.reviews)) + geom_contour_filled(aes(z = intr.hat)) + labs(subtitle = "Interaction Effects of Avg.review and Total.reviews") + guides(fill=guide_legend(title="Sales"))
# Calculate correlation coefficients
rcorr(as.matrix(select(pub_df, daily.sales, sale.price)))
## daily.sales sale.price
## daily.sales 1.00 -0.28
## sale.price -0.28 1.00
##
## n= 5999
##
##
## P
## daily.sales sale.price
## daily.sales 0
## sale.price 0
ggplot(pub_df, aes(x=sale.price, y=daily.sales, alpha = 0.1)) + geom_point() + labs(y="Number of Sales Daily", x="Sale Price", title=expression(r==-0.28), subtitle="The effect of sale price upon the number of sales") + geom_smooth(method=lm)
## `geom_smooth()` using formula = 'y ~ x'
m.dsales.by.price <- lm(daily.sales~sale.price, data=pub_df)
summary(m.dsales.by.price)
##
## Call:
## lm(formula = daily.sales ~ sale.price, data = pub_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -80.760 -20.644 -4.638 17.084 130.301
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 112.0540 1.5201 73.72 <2e-16 ***
## sale.price -3.8110 0.1704 -22.36 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29.15 on 5997 degrees of freedom
## Multiple R-squared: 0.07696, Adjusted R-squared: 0.0768
## F-statistic: 500 on 1 and 5997 DF, p-value: < 2.2e-16
# Get the confidence intervals for the estimation approach
cbind(coefficient = coef(m.dsales.by.price), confint(m.dsales.by.price))
## coefficient 2.5 % 97.5 %
## (Intercept) 112.054023 109.074077 115.033968
## sale.price -3.810984 -4.145101 -3.476867
Is this (the effect of sale price upon the number of sales) different across genres?
ggplot(pub_df, aes(x=sale.price, y=daily.sales, alpha = 0.1)) + geom_point() + labs(y="Number of Sales Daily", x="Sale Price", subtitle="The effect of sale price upon the number of sales") + geom_smooth(method=lm) + facet_wrap(~ genre)
## `geom_smooth()` using formula = 'y ~ x'
# Filter by genre
child_df <- pub_df %>%
filter(genre == "childrens")
fic_df <- pub_df %>%
filter(genre == "fiction")
nonf_df <- pub_df %>%
filter(genre == "non_fiction")
# Calculate correlation coefficients
rcorr(as.matrix(select(child_df, daily.sales, sale.price)))
## daily.sales sale.price
## daily.sales 1.00 -0.23
## sale.price -0.23 1.00
##
## n= 2000
##
##
## P
## daily.sales sale.price
## daily.sales 0
## sale.price 0
rcorr(as.matrix(select(fic_df, daily.sales, sale.price)))
## daily.sales sale.price
## daily.sales 1.00 -0.02
## sale.price -0.02 1.00
##
## n= 2000
##
##
## P
## daily.sales sale.price
## daily.sales 0.4203
## sale.price 0.4203
rcorr(as.matrix(select(nonf_df, daily.sales, sale.price)))
## daily.sales sale.price
## daily.sales 1.00 -0.04
## sale.price -0.04 1.00
##
## n= 1999
##
##
## P
## daily.sales sale.price
## daily.sales 0.0539
## sale.price 0.0539
# Build linear regression models
m.child.dsales.by.price <- lm(daily.sales~sale.price, data=child_df)
summary(m.child.dsales.by.price)
##
## Call:
## lm(formula = daily.sales ~ sale.price, data = child_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.929 -9.503 -0.030 9.585 56.579
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 72.8781 1.6336 44.61 <2e-16 ***
## sale.price -1.7319 0.1603 -10.80 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.46 on 1998 degrees of freedom
## Multiple R-squared: 0.0552, Adjusted R-squared: 0.05472
## F-statistic: 116.7 on 1 and 1998 DF, p-value: < 2.2e-16
m.fic.dsales.by.price <- lm(daily.sales~sale.price, data=fic_df)
summary(m.fic.dsales.by.price)
##
## Call:
## lm(formula = daily.sales ~ sale.price, data = fic_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -102.383 -20.066 0.179 20.095 102.366
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 108.0774 2.7970 38.640 <2e-16 ***
## sale.price -0.2732 0.3389 -0.806 0.42
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29.34 on 1998 degrees of freedom
## Multiple R-squared: 0.000325, Adjusted R-squared: -0.0001753
## F-statistic: 0.6496 on 1 and 1998 DF, p-value: 0.4203
m.nonf.by.price <- lm(daily.sales~sale.price, data=nonf_df)
summary(m.nonf.by.price)
##
## Call:
## lm(formula = daily.sales ~ sale.price, data = nonf_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -70.526 -13.381 -0.048 13.322 86.960
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 79.2755 1.8043 43.936 <2e-16 ***
## sale.price -0.4262 0.2210 -1.929 0.0539 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.97 on 1997 degrees of freedom
## Multiple R-squared: 0.001859, Adjusted R-squared: 0.001359
## F-statistic: 3.719 on 1 and 1997 DF, p-value: 0.05393
# Get the confidence intervals for the estimation approach
cbind(coef(m.child.dsales.by.price), confint(m.child.dsales.by.price))
## 2.5 % 97.5 %
## (Intercept) 72.878117 69.674294 76.081941
## sale.price -1.731864 -2.046235 -1.417494
cbind(coef(m.fic.dsales.by.price), confint(m.fic.dsales.by.price))
## 2.5 % 97.5 %
## (Intercept) 108.0773903 102.5919734 113.5628073
## sale.price -0.2731555 -0.9378044 0.3914934
cbind(coef(m.nonf.by.price), confint(m.nonf.by.price))
## 2.5 % 97.5 %
## (Intercept) 79.2754689 75.7369156 82.81402222
## sale.price -0.4262021 -0.8596162 0.00721202
This report demonstrates the results of the analyses from the dataset of publishing company. The total number of the sales records in the data is 6000. There was only one inaccurate data in the dataset and has been removed, leaving 5999 records.
First, we look at the summary of the sales records distinguished by genres.| genre | n() |
|---|---|
| childrens | 2000 |
| fiction | 2000 |
| non_fiction | 1999 |
A one-way ANOVA tests whether the variable genre has a significant effect on the daily sales. We can say “The daily sales differs significantly across the genre of books, F(2,5996)=2594.7, p<0.001.
It was also found that fiction books have the highest average sales(105.9), followed by non_fiction(75.9), and children has the lowest daily sales children(55.6). The estimates of the difference in daily sales for each pair of genres is demonstrated in the right panel below. The estimate for the childrens-fiction comparison shows a point estimate of 50.3 daily sales greater gain for fiction, which is the greatest difference in the three pairs. And the 95% CI spans sales greater for fiction from 48.7 to 52; the estimate for the childrens-non_fiction comparison shows a point estimate of 20.3 daily sales greater gain for non_fiction, and the 95% CI spans sales greater for non_fiction from 18.7 to 22; the estimate for the fiction-non_fiction comparison shows a point estimate of 30 daily sales greater gain for fiction, and the 95% CI spans sales greater for fiction from 28.3 to 31.6.
Next, a model was built to look into whether books have more sales depending upon their average review scores and total number of reviews.
From the results, we can see the number of average daily sales decreases by a statistically significant 4, t(5996)=-7.80, p<.0001, for every extra increase in average review score, holding the total number of reviews constant.
When controlling for the average review scores, the average number of sales increases by 0.54 for every extra review, which is significantly different from zero t(5996)=69.52, p<.0001.
And we also calculate the confidence intervals for the estimation approach because it gives us more information. The number of average daily sales decreases by 4.00, 95% CI [-5.00, -3.00] for every extra average review score; the number of average daily sales increases by 0.54, 95% CI [0.53, 0.56] for every extra review.
Sometimes, we may encounter situations that the effect of a predictor is different depending upon the value of another variable. Therefore, we would like to build a new model to see the interaction between average review scores and the total number of reviews. The result shows that there is a a significant positive interaction between them. This means that when the value of average review scores is higher, a slight increase in the total number of reviews will lead to massive rise in sales.
## `geom_smooth()` using formula = 'y ~ x'
Next, a model was built to look into the effect a unit change in sale price (on the x axis) has on the number of sales. From the reslut, we see that the number of sales decreases 3.81 for extra every extra unit of sale price, and the sale price explains 7.7% variance in number of average daily sales. The decrease is significantly different from zero, t(5997)=-22.36, p<.0001. Therefore, it is reasonable to use sales price to predict average daily sales.
## `geom_smooth()` using formula = 'y ~ x'
We can see the correlation between sale price and number of average daily sales across different genres are all negative. It indicates that an increase in sale price has a negative impact on the number of sales regardless of book genre; however, they are just slightly related. For every extra unit of sale price in children type, 1.73 lower number of sales is sold 95% CI [1.42–2.05], and it explains 5.52% of the variance in daily sales; for every extra unit of sale price in fiction type, 0.27 lower number of sales is sold 95% CI [-0.94–0.39], and it explains 0.03% of the variance in daily sales; For every extra unit of sale price in non_fiction type, 0.43 lower number of sales is sold 95% CI [-0.86–0.01], and it explains 1.86% of the variance in daily sales.
Based on the result from the built linear models, only the children type effect of sale price upon average daily sales is significant, t(1998)=-10.8, p<0.0001. The effect of sale price upon the number of sales from the other two types are not significant.